Machine Learning-Driven Predictive Modeling for Opioid Use Disorder in Chronic Pain Patients on Long-Term Opioid Use

Overview

This predictive model aims to identify opioid use disorder among chronic pain patients on long-term opioid therapy by incorporating diverse risk factors, grounded in the biopsychosocial model of pain. It provides a robust predictive framework and serves as a foundation for more advanced machine learning models that will incorporate a broader range of psychiatric conditions, ultimately enhancing early identification and intervention strategies for at-risk patients. For this project, I worked with two psychology experts specializing in substance use disorders, including opioid use disorder.

  • Link to final project GitHub repository) : https://github.com/yoonjae-hub/BMIN503_Final_Project.git

1. Introduction

Nearly 51.6 millions of Americans living with chronic pain (Rikard, Strahan et al. 2023). Managing chronic non-cancer pain (CNCP) is particularly challenging secondary to its complex neurobiological and psychosocial mechanisms (Fillingim, Bruehl et al. 2014), profoundly impacting patients’ physical, psychological, and social well-being (Genova, Dix et al. 2020). CNCP is defined as persistent or recurrent pain lasting three months or longer, and encompasses a range of non-malignant conditions, including neuropathic pain, rheumatoid arthritis, lower back pain, osteoarthritis, fibromyalgia, and various other conditions. A major concern with CNCP is the ongoing use of opioids. Opioids are widely prescribed for chronic pain, but can cause adverse outcomes, such as unintentional overdose, misuse, addiction, and even death (Dahlhamer, Connor et al. 2021, Dowell, Ragan et al. 2022). As many as 25% of patients treated with opioids for chronic pain are known to develop opioid use disorder (OUD) (Cordier Scott and Lewis 2016).

Leveraging AI in healthcare can transform OUD management by analyzing large datasets to develop machine learning (ML) algorithms that identify individuals at risk. This approach enables earlier intervention and more personalized care. Recently, the FDA approved AvertD, an ML-based risk score model designed to assess genetic predisposition to OUD using a prescription-only genotyping test (U.S. Food and Drug Administration 2024). However, AvertD relies solely on buccal swab specimens, excluding significant socioeconomic, physical, and psychological factors associated with OUD. Additionally, AvertD is designed for first-time opioid users, while OUD incidence is higher among individuals on long-term opioid therapy. This highlights the need for an ML model that targets chronic opioid users and integrates broader risk factors for better accessibility and applicability. Therefore, this study aims to develop an ML model to predict OUD among chronic pain patients who have been on opioid therapy, integrating biopsychosocial factors that are known to influence OUD risk.

This problem is inherently interdisciplinary because pain, particularly in the context of chronic pain and opioid use disorder (OUD), cannot be fully understood or addressed from a single disciplinary perspective. The biopsychosocial model emphasizes that pain is influenced not only by biological factors but also by psychological and social dimensions. Therefore, incorporating experts from psychology is essential to explore the psychological underpinnings of pain, such as emotional distress, coping mechanisms, and the role of comorbid psychiatric conditions like anxiety, depression, or substance use disorders.

2. Methods

Data source This project used dataset that were collected for a larger parent study that evaluated the phenotypic and genotypic profiles of CNCP patients on opioid therapy who did and did not develop an OUD. Data were collected between November 2012 and September 2018. This study includes 1,356 patients who were receiving long-term (6 months or more) opioid therapy (LTOT) to manage CNCP. Subjects were collected from three different regions (Pennsylvania, Washington, and Utah) in the United States. The eligibility criteria for the parent study were: individuals who were (1) aged 18 or older, (2) Caucasian and of European descent (defined as 3 or 4 grandparents were European), (3) experiencing CNCP of musculoskeletal or neuropathic origin persisting for at least 6 months, (4) having no history of substance abuse (except nicotine) before starting LTOT, and (5) having received LTOT to treat their pain for at least the past 6 months. Due to the genotypic and phenotypic objectives of the parent study, persons who are non-Caucasian or with a previous history of substance use disorder were excluded from the study. Additional criteria for exclusion included severe psychiatric conditions that hindered the ability to provide informed consent or complete the questionnaire; and individuals experiencing pain conditions not arising from musculoskeletal/neuropathic origin (e.g., cancer, gynecologic issues, abdominal discomfort, visceral concerns, dental problems, neuropathic pain due to metabolic disease).

Determination of OUD For this analysis, participants were grouped as either ‘cases’ or ‘controls’ based on whether or not they developed OUD after initiating LTOT. The control group included patients who did not have evidence of opioid abuse or meet the criteria for OUD at the time of assessment. Their electronic health records (EHR) were reviewed monthly for 12 consecutive months after enrolling in the study to ensure they did not develop an OUD after completing the baseline assessment. To be considered controls, patients receiving LTOT needed to have negative urine drug screens (which detect opioid metabolites and other illicit drugs), have no record of current or past SUD, and have evidence of no aberrant drug-related behaviors assessed using an expert developed checklist (Cheatle, O’Brien et al. 2013). Cases were patients who did not have a history of SUD (except nicotine) prior to beginning LTOT and currently met DSM-IV criteria for ‘opioid dependence’, which were obtained at the time of enrolling in the study.

Feature Selection Given the biopsychosocial aspects of OUD, the initial analysis will explore variables such as demographic factors (age, sex, race, ethnicity, marital status, education, employment, financial situation), pain (severity, interference), nicotine dependence, psychiatric conditions (depression, anxiety, pain catastrophizing, mental defeat, suicidality), and social support. To check multicollinearity among these variables, variance inflation factor (VIF) was used.

Implementation of ML This study will use a supervised learning approach to develop the predictive model. Given the binary outcome, algorithms such as logistic regression, random forest, and gradient boosting will be employed. To compare performance, each algorithm will be tested using area under the receiver operating characteristic curve (ROC-AOC). A potential challenge in this approach is missing data, as some variables rely on total scores. Missing data in these variables can lead to more extensive data gaps since individual items are not considered separately. To address this, missing data imputation techniques will be considered. Also, to ensure model generalization and detect potential overfitting/underfitting, cross-validation methods (k-fold cross-validation) will be applied.

2.1 Loading Packages

library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) # For graphs
library(lubridate) # For manipulating dates
library(gtsummary) #Summary statistics
library(RColorBrewer) # For coloring the plots
library(cowplot) # For combining multiple plots into one

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggpubr) # Combining multiple graphs into one figure 

Attaching package: 'ggpubr'

The following object is masked from 'package:cowplot':

    get_legend
library(pROC) # For cross-validation
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var
library(dplyr) # For data cleaning
library(gtsummary)
library(haven)
library(rsample)
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ recipes      1.1.0
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ car::recode()     masks dplyr::recode()
✖ car::some()       masks purrr::some()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
library(dotwhisker)
library(parsnip)

2.2 Variables

#Predictors
#Demographic - gender (DEM002), age (DEM003), marital status (DEM004), living along (DEM005), education (DEM006), employment (DEM007), financial status (DEM008)
#Plesae note that due to the genotypic and phenotypic objectives of the parent study, persones who are non-Caucasian or with a previous history of substance use disorder were excluded from the study)

demp_v2 <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/demp_v2.sav")

#Pain - severity, interference - Brief Pain Inventory
bpip <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/bpip.sav")

#Smoking- Fagerstrom scale
fnts<- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/fnts.sav")

#Alcohol
aadidm <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/aadidm.sav")

#Anxiety/Depression-PHQ-4
phq4 <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/phq4.sav")

#Pain catastrophizing (Subscale) - Coping strategies questionnaire
copsq<- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/copsq.sav")

#Mental defeat-Pain perception scale
pps <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/pps.sav")

#Suicidality - SBQ-R
sbqr <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/sbqr.sav")

#Social support - Duke social support index
dssi_v2 <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/dssi_v2.sav")

#Outcome - Opioid Use Disorder
pevc <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/pevc.sav")

2.3 Cleaning and Transforming Data for Analysis

#Demographic - gender, age, marital status, living status, education, employment, financial status
demo <-demp_v2 %>%
  select(DEM002, DEM003, DEM004, DEM005, DEM006, DEM007, DEM008, SubjID)
demo <- demo %>%
  mutate(across(
    where(~inherits(., "haven_labelled")), 
    as_factor
  ))
demo <- demo %>%
  mutate(across(
    where(~inherits(., "haven_labelled") & is.numeric(.)),
    as.numeric
  ))

demo$DEM003[which(demo$DEM003 == "(X)Subject refused to answer")] <- NA
demo$DEM005[which(demo$DEM005 == "Do not know/refused")] <- NA
demo$DEM005 <- droplevels(demo$DEM005)
levels(demo$DEM005)
[1] "Live alone"                 "One other person"          
[3] "More than one other person"
demo$DEM006[which(demo$DEM006 == "(X)Subject refused to answer")] <- NA
demo$DEM006 <- droplevels(demo$DEM006)
levels(demo$DEM006)
[1] "Grade 6 or less"                                
[2] "Grade 7 to 12"                                  
[3] "Graduated high school or high school equivalent"
[4] "Part college"                                   
[5] "Graduated 2 years college"                      
[6] "Graduated 4 years college"                      
[7] "Part graduate/professional school"              
[8] "Completed graduate/professional school"         
demo$DEM007 <- as.character(demo$DEM007)
demo$DEM007[which(demo$DEM007 == "Do not know/refused")] <- NA
demo$DEM007 <- as_factor(demo$DEM007)
demo$DEM007 <- droplevels(demo$DEM007)
levels(demo$DEM007)
[1] "Yes" "No" 
demo$DEM008 <- as.character(demo$DEM008)
demo$DEM008[which(demo$DEM008 == "Do not know/Refused")] <- NA
demo$DEM008 <- as_factor(demo$DEM008)
demo$DEM008 <- droplevels(demo$DEM008)
levels(demo$DEM008)
[1] "Ca not make ends meet"         "Have just enough to get along"
[3] "Are comfortable"              
#transforming to factor (all, except DEM003 - age)
demo$DEM002 <- as_factor(demo$DEM002)
demo$DEM003 <- as.numeric(demo$DEM003)
demo$DEM004 <- as_factor(demo$DEM004)
demo$DEM005 <- as_factor(demo$DEM005)
demo$DEM006 <- as_factor(demo$DEM006)
demo$DEM007 <- as_factor(demo$DEM007)
demo$DEM008 <- as_factor(demo$DEM008)


#checking (examples)
class(demo$DEM004)       # Should show "factor"
[1] "factor"
levels(demo$DEM005)      # no 'do not know'refused'
[1] "Live alone"                 "One other person"          
[3] "More than one other person"
is.numeric(demo$DEM005)  # Should return FALSE
[1] FALSE
#pain
#pain-severity
pain_severity <- bpip %>%
  filter(month == 0) %>%
  mutate(pain_severity_sum = if_else(if_any(c(BPI001, BPI002, BPI003, BPI004), ~ . == -9),
                                     NA_real_,
                                     BPI001 + BPI002 + BPI003 + BPI004)) %>%
  select(pain_severity_sum, SubjID)

#pain-interference
pain_interf <- bpip %>%
  filter(month == 0) %>%
  mutate(pain_interf_sum = if_else(if_any(c(BPI007A, BPI007B, BPI007C, BPI007D, BPI007E, BPI007G), ~ . == -9),
                                   NA_real_, # changing -9 to NA value
                                   BPI007A + BPI007B + BPI007C + BPI007D + BPI007E + BPI007G)) %>%
  select(pain_interf_sum, SubjID)

#smoking : conditional survey (if FNTS000 entry question is 0, rest of questions are a system missing value; No missing value in FNTS000)
missing_fnts <- sum(is.na(fnts$FNTS000))
missing_fnts #for the initial screening question, there is no missing data.
[1] 0
# 001, 004: Multiple-choice items (0-3) / # 2,3,5,6: Yes/No items (0-1)
fnts_cleaned <- fnts %>%
  mutate(
    fnts_sum = ifelse(FNTS000 == 0, NA,
                         FNTS001 + FNTS004 + 
                         FNTS002 + FNTS003 + FNTS005 + FNTS006)) %>%
  replace_na(list(fnts_sum=0))

fnts_cleaned <- fnts_cleaned %>%
  filter(month==0)

fnts_cleaned <- fnts_cleaned %>%
  mutate(dependence_level = case_when(
    fnts_sum == 0 ~ "Non-smoker",
    fnts_sum >= 1 & fnts_sum <= 2 ~ "Low",
    fnts_sum >= 3 & fnts_sum <= 4 ~ "Low to Moderate",
    fnts_sum >= 5 & fnts_sum <= 7 ~ "Moderate",
    fnts_sum >= 8 ~ "High")) %>%
  select(SubjID, fnts_sum, dependence_level)

fnts_cleaned_2 <- fnts_cleaned %>%
  select(SubjID, dependence_level) #sort out only two variables

#Alcohol use - more conservative, in the last 3 months
aadidm_cleaned <- aadidm %>%
  filter(month == 0) %>%
  mutate(
    alcohol = ifelse(AAD001 == 1, 1, 0),
    alcohol = replace(alcohol, is.na(alcohol), 0)  # Replace NA with 0
  ) %>%
  select(SubjID, alcohol)
table(aadidm_cleaned$alcohol)

  0   1 
807 517 
#PHQ-4 (Anxiety/Depression)
phq4 <- phq4 %>%
  filter (month == 0)
#PHQ-4: dichotomizing Anxiety, cutoff >=3
#PHQ-4: dichotomizing Depression, cutoff >=3
phq4 <- phq4 %>%
  mutate(Anxiety = ifelse(PHQ001 + PHQ002 >= 3, 1, 0),
         Depression = ifelse(PHQ003 + PHQ004 >= 3, 1, 0))
phq4_cleaned <- phq4 %>%
  select(Anxiety, Depression, SubjID)

phq4_cleaned$Anxiety <- as.factor(phq4_cleaned$Anxiety)
phq4_cleaned$Depression <- as.factor(phq4_cleaned$Depression)

#CSQ - Pain catastrophizing subscale only (score already calculated in the orignal dataset: CATAT)
ca_catastro <- copsq %>%
  filter(month == 0) %>%
  mutate(ca_sum = if_else(if_any(c(COP005, COP012, COP014, COP028, COP038, COP042), ~ . == -9),
                                   NA_real_,
                                   COP005 + COP012 + COP014 + COP028 + COP038 + COP042)) %>%
  select(ca_sum,SubjID)

##PSPS -HAVE some missing vales (pps_sum)
pps_cleaned <-pps %>%
  mutate(pps_sum= rowSums(across(starts_with("PPS"), ~ ifelse(. == -9, NA, .), .names = "new_{.col}")), na.rm= FALSE)
pps_cleaned <- pps_cleaned%>%
  select(pps_sum, SubjID)

missing_pps <- sum(is.na(pps_cleaned$pps_sum))
missing_pps #n=85
[1] 85
#Suicidality: cutoff >=8 (higher risk) for clinical samples / no missing values, but '-9' refused to answer-> Total score is NA whenever a -9 value is present in any of the specified columns. (NA in total score = only 6 data)
#scoring website: https://www.ncbi.nlm.nih.gov/books/NBK571017/box/ch3.b11/?report=objectonly

sbqr_cleaned <- sbqr %>%
  filter(month==0) %>%
  mutate(
    SBQ001 = case_when(
      SBQ001 %in% c(3, 4) ~ 3, #regrouping
      SBQ001 %in% c(5, 6) ~ 4,
      TRUE ~ SBQ001),
    SBQ003 = case_when(
      SBQ003 %in% c(2, 3) ~ 2,
      SBQ003 %in% c(4, 5) ~ 3,
      TRUE ~ SBQ003))

sbqr_cleaned <- sbqr_cleaned %>%
     mutate(sbqr_sum = if_else(
      rowSums(select(., SBQ001, SBQ002, SBQ003, SBQ004) == -9) > 0,
      NA_real_,
      rowSums(select(., SBQ001, SBQ002, SBQ003, SBQ004), na.rm = TRUE))) %>%
  select(SubjID, sbqr_sum)

#Categorize suicidality risk based on sbqr_sum (>=8; high risk, <8: low risk, NA=> NA)(>=8; high risk, <8: low risk, NA=> NA); missing data = 6ea
sbqr_cleaned <- sbqr_cleaned %>%
  mutate(
    suicidality_risk = case_when(
      sbqr_sum >= 8 ~ "High Risk",
      sbqr_sum < 8 ~ "Low Risk",
      is.na(sbqr_sum) ~ NA_character_))

suicidality <- sbqr_cleaned %>%
  select(SubjID, suicidality_risk)

#social support (DSSI) : 3 subscales scores
dssi_cleaned <- dssi_v2 %>%
  filter(month==0) %>%
  select(SubjID, SIDSSI, SSDSSI, ISDSSI)
#social interaction (SIS, 4 item) : assessing the amount of time: 3 point likert scale (missing data: NONE)
## dssi_cleaned$SIDSSI

#subjective social support (SSS, 7 items): assessing depth or quality of relationship: 3point (missing data: 14)
## dssi_cleaned$SSDSSI

#Instrumental social support (ISS, 12 items): degree to which others can be counted upon with daily activities;0/1; (missing data: 4)
## dssi_cleaned$ISDSSI

#outcome: opioid use disorder (cohort - 1: case=OUD, 0: control=no OUD)
cohort <- read.csv("/Users/yoonjaelee/Desktop/Dr.CHEATLE ML PROJECT/pevc_compiled.csv")
cohort_cleaned <- cohort %>%
  select(SubjID,cohort)
cohort_cleaned$cohort <- as.factor(cohort_cleaned$cohort) #changing the class to categorical values.
class(cohort_cleaned$cohort) #factor
[1] "factor"
#removing 16 Nas
cohort_cleaned <- cohort_cleaned[!is.na(cohort_cleaned$cohort), ]
cohort_cleaned <- cohort_cleaned %>%
  filter(SubjID != 0) #remove SubjID==0
table(cohort_cleaned$cohort)  #cohort 0:1042; 1:486

   0    1 
1042  486 
##Merging data_using left_join
merging1 <- left_join(cohort_cleaned, demo, by = "SubjID")
## Note that when merging demo file with cohort_cleaned file, there are missing data -> remove; not registered for the study. (removed 173 patients)
merging1 <- merging1 %>%
  drop_na()
merging2 <- left_join(merging1, pain_severity, by = "SubjID")
merging3 <- left_join(merging2, pain_interf, by = "SubjID")
merging4 <- left_join(merging3, fnts_cleaned_2, by = "SubjID")
merging5 <- left_join(merging4, phq4_cleaned, by = "SubjID")
merging6 <- left_join(merging5, ca_catastro, by = "SubjID")
merging7 <- left_join(merging6, pps_cleaned, by = "SubjID")
merging8 <- left_join(merging7, suicidality, by = "SubjID")
merging9 <- left_join(merging8, aadidm_cleaned, by = "SubjID")
final_data <- left_join(merging9, dssi_cleaned, by = "SubjID") #final sample N= 1331
table(final_data$cohort) #901/430

  0   1 
901 430 

2.4 Exploratory Data Analysis (Distribution/Association)

Based on distribution analysis, most of variables were skewed. However, given the large sample size, we can rely on the Central Limit Theorem to approximate normalilty.

#Categorical variables
#1-1) Fnts - Visualization
fnts_eda <- left_join(cohort_cleaned, fnts_cleaned, by = "SubjID") %>%
  filter(!is.na(dependence_level), !is.na(cohort))

ggplot(fnts_eda, aes(x = dependence_level, fill = as.factor(cohort))) +
  geom_bar(position = "dodge") +
  labs(x = "Dependence Level", y = "Count", fill = "OUD Status") +
  scale_fill_discrete(labels = c("Control", "Case")) +
  theme_minimal()

oud_smoking <- table(final_data$cohort, final_data$dependence_level)
print(oud_smoking) # All cells >5. okay to perform chi-square test
   
    High Low Low to Moderate Moderate Non-smoker
  0   20  54              47       59        694
  1   43  49              75      123        109
chi_oud_smoking <- chisq.test(oud_smoking)
print(chi_oud_smoking) # p-value < 2.2e-16 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_smoking
X-squared = 332.86, df = 4, p-value < 2.2e-16
#2-1) phq4 - Distribution
phq4_eda <- left_join(cohort_cleaned, phq4_cleaned, by = "SubjID") %>%
  filter(!is.na(Anxiety), !is.na(Depression))

ggplot(phq4_eda, aes(x = Anxiety, fill = as.factor(cohort))) +
    geom_bar(position = "dodge") +
    labs(x = "Anxiety", y = "Count", fill = "OUD Status") +
   scale_fill_discrete(labels = c("Control", "Case")) +
   scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
    theme_minimal()

ggplot(phq4_eda, aes(x = Depression, fill = as.factor(cohort))) +
    geom_bar(position = "dodge") +
    labs(x = "Depression", y = "Count", fill = "OUD Status") +
   scale_fill_discrete(labels = c("Control", "Case")) +
   scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
    theme_minimal()

#2-2) phq4 - Association
oud_anxiety <- table(final_data$cohort, final_data$Anxiety)
print(oud_anxiety) # expected counts >5. okay to perform chi-square test
   
      0   1
  0 473 396
  1 144 248
chi_oud_anxiety <- chisq.test(oud_anxiety, correct=F)
print(chi_oud_anxiety) # p-value = 2.391e-09 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_anxiety
X-squared = 33.852, df = 1, p-value = 5.947e-09
oud_dep <- table(final_data$cohort, final_data$Depression)
print(oud_dep) # expected counts >5. okay to perform chi-square test
   
      0   1
  0 517 352
  1 156 236
chi_oud_dep <- chisq.test(oud_dep, correct=F)
print(chi_oud_dep) # p-value = 3.518e-11 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_dep
X-squared = 42.117, df = 1, p-value = 8.595e-11
#3-1) SBQR-R - Distribution
sbqr_eda <- left_join(cohort_cleaned, sbqr_cleaned, by = "SubjID")%>%
  filter(!is.na(suicidality_risk))

ggplot(sbqr_eda, aes(x = suicidality_risk, fill = as.factor(cohort))) +
    geom_bar(position = "dodge") +
    labs(x = "Suicidality risk", y = "Count", fill = "OUD Status") +
   scale_fill_discrete(labels = c("Control", "Case")) +
   scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
    theme_minimal()

#3-2) SBQR-R - Association
oud_sui <- table(final_data$cohort, final_data$suicidality_risk)
print(oud_sui) # expected counts >5. okay to perform chi-square test
   
    High Risk Low Risk
  0        58      429
  1        77      161
chi_oud_sui <- chisq.test(oud_sui, correct=F)
print(chi_oud_sui) # p-value = 3.123e-11 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_sui
X-squared = 44.092, df = 1, p-value = 3.133e-11
#Continuous variables 
#1-1)Pain severity - Visualization
ps_data_noNA <- final_data %>%
  filter(!is.na(cohort), !is.na(pain_severity_sum))

ggplot(ps_data_noNA, aes(x = as.factor(cohort), y = pain_severity_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Pain Severity Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(ps_data_noNA, aes(x = pain_severity_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Pain Severity Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

#1-2)Pain severity - Association
ps.nonoud <- ps_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,pain_severity_sum)

ps.oud <- ps_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, pain_severity_sum)

var.test(ps.nonoud$pain_severity_sum, ps.oud$pain_severity_sum) #p-value = 0.001172 -> unequal variance -> welch's t-test

    F test to compare two variances

data:  ps.nonoud$pain_severity_sum and ps.oud$pain_severity_sum
F = 0.7723, num df = 875, denom df = 399, p-value = 0.002074
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6513741 0.9108211
sample estimates:
ratio of variances 
         0.7723014 
ps_ttest <- t.test(ps.nonoud$pain_severity_sum, ps.oud$pain_severity_sum)
print(ps_ttest) #statistically different.

    Welch Two Sample t-test

data:  ps.nonoud$pain_severity_sum and ps.oud$pain_severity_sum
t = 5.222, df = 690.86, p-value = 2.345e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.583093 3.490811
sample estimates:
mean of x mean of y 
 21.72945  19.19250 
#2-1)pain interference - Visualization
pi_data_noNA <- final_data %>%
  filter(!is.na(cohort), !is.na(pain_interf_sum))


ggplot(pi_data_noNA, aes(x = as.factor(cohort), y = pain_interf_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Pain Interference Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(pi_data_noNA, aes(x = pain_interf_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Pain Interference Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

#2-2)pain interference - Association
pi.nonoud <- pi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,pain_interf_sum)

pi.oud <- pi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, pain_interf_sum)

var.test(pi.nonoud$pain_interf_sum, pi.oud$pain_interf_sum) # p-value = 0.8104 -> equal variance

    F test to compare two variances

data:  pi.nonoud$pain_interf_sum and pi.oud$pain_interf_sum
F = 0.9773, num df = 826, denom df = 390, p-value = 0.7834
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.8219306 1.1559193
sample estimates:
ratio of variances 
         0.9773002 
pi_ttest <- t.test(pi.nonoud$pain_interf_sum, pi.oud$pain_interf_sum, var.equal = T)
print(pi_ttest) #statistically different.

    Two Sample t-test

data:  pi.nonoud$pain_interf_sum and pi.oud$pain_interf_sum
t = 2.168, df = 1216, p-value = 0.03035
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.194941 3.906501
sample estimates:
mean of x mean of y 
 33.58525  31.53453 
#3-1) pain catastrophizing - Distribution
pc_data_noNA <- final_data %>%
  filter(!is.na(cohort), !is.na(ca_sum))

ggplot(pc_data_noNA, aes(x = as.factor(cohort), y = ca_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Pain Catastrophizing Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(pc_data_noNA, aes(x = ca_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Pain Catastrohpizing", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

#3-2) pain catastrophizing - Association
pc.nonoud <- pc_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,ca_sum)

pc.oud <- pc_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, ca_sum)

var.test(pc.nonoud$ca_sum, pc.oud$ca_sum) #p-value = 0.01178 -> unequal variance -> welch's t-test

    F test to compare two variances

data:  pc.nonoud$ca_sum and pc.oud$ca_sum
F = 1.2806, num df = 836, denom df = 366, p-value = 0.006322
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.072940 1.519366
sample estimates:
ratio of variances 
           1.28062 
pc_ttest <- t.test(pc.nonoud$ca_sum, pc.oud$ca_sum)
print(pc_ttest) #statistically different.

    Welch Two Sample t-test

data:  pc.nonoud$ca_sum and pc.oud$ca_sum
t = -7.9363, df = 784.18, p-value = 7.194e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.699746 -2.835854
sample estimates:
mean of x mean of y 
 14.26762  18.03542 
#4-1)PSPS - Mental defeat - Distribution
psps_data_noNA <- final_data %>%
  filter(!is.na(pps_sum))

ggplot(psps_data_noNA, aes(x = as.factor(cohort), y = pps_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Mental defeat", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(psps_data_noNA, aes(x = ca_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Mental Defeat", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()
Warning: Removed 29 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 29 rows containing non-finite outside the scale range
(`stat_density()`).

#4-2) PSPS - Mental defeat - Association
md.nonoud <- psps_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,pps_sum)

md.oud <- psps_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, pps_sum)

var.test(md.nonoud$pps_sum, md.oud$pps_sum) #p-value = 0.3597 -> equal variance -> Levene's t-test

    F test to compare two variances

data:  md.nonoud$pps_sum and md.oud$pps_sum
F = 1.1209, num df = 582, denom df = 225, p-value = 0.3163
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.896521 1.386564
sample estimates:
ratio of variances 
          1.120889 
md_ttest <- t.test(md.nonoud$pps_sum, md.oud$pps_sum, var.equal=T)
print(md_ttest) #statistically different.

    Two Sample t-test

data:  md.nonoud$pps_sum and md.oud$pps_sum
t = -6.6498, df = 807, p-value = 5.407e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18.371066  -9.997216
sample estimates:
mean of x mean of y 
 31.39108  45.57522 
#5-1) DSSI - Distribution (SIDSSI, SSDSSI, ISDSSI (all continuous))
dssi_data_noNA <- final_data %>%
  filter(!is.na(SIDSSI), !is.na(SSDSSI), !is.na(ISDSSI))

ggplot(dssi_data_noNA, aes(x = SIDSSI, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "SIDSSI", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

ggplot(dssi_data_noNA, aes(x = SSDSSI, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "SSDSSI", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

ggplot(dssi_data_noNA, aes(x = ISDSSI, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "ISDSSI", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

#5-2) DSSI - Association
#5-2-1) SIDSSI
sidssi.nonoud <- dssi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,SIDSSI)

sidssi.oud <- dssi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, SIDSSI)

var.test(sidssi.nonoud$SIDSSI, sidssi.oud$SIDSSI) #p-value = 0.5777 -> equal variance -> Levene's t-test

    F test to compare two variances

data:  sidssi.nonoud$SIDSSI and sidssi.oud$SIDSSI
F = 0.94976, num df = 838, denom df = 375, p-value = 0.5488
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7970469 1.1252390
sample estimates:
ratio of variances 
         0.9497572 
sidssi_ttest <- t.test(sidssi.nonoud$SIDSSI, sidssi.oud$SIDSSI, var.equal=T)
print(sidssi_ttest) #statistically NOT different.

    Two Sample t-test

data:  sidssi.nonoud$SIDSSI and sidssi.oud$SIDSSI
t = 0.2859, df = 1213, p-value = 0.775
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2432314  0.3262139
sample estimates:
mean of x mean of y 
 5.193087  5.151596 
#5-2-2) SSDSSI
ssdssi.nonoud <- dssi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,SSDSSI)

ssdssi.oud <- dssi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, SSDSSI)

var.test(ssdssi.nonoud$SSDSSI, ssdssi.oud$SSDSSI) #p-value = 0.006159 -> unequal variance

    F test to compare two variances

data:  ssdssi.nonoud$SSDSSI and ssdssi.oud$SSDSSI
F = 0.79046, num df = 838, denom df = 375, p-value = 0.006448
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6633604 0.9365058
sample estimates:
ratio of variances 
         0.7904571 
ssdssi_ttest <- t.test(ssdssi.nonoud$SSDSSI, ssdssi.oud$SSDSSI)
print(ssdssi_ttest) #statistically different.

    Welch Two Sample t-test

data:  ssdssi.nonoud$SSDSSI and ssdssi.oud$SSDSSI
t = 9.7284, df = 651.18, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.724496 2.596701
sample estimates:
mean of x mean of y 
 17.94517  15.78457 
#5-2-3) ISDSSI
isdssi.nonoud <- dssi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,ISDSSI)

isdssi.oud <- dssi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, ISDSSI)

var.test(isdssi.nonoud$ISDSSI, isdssi.oud$ISDSSI) #p-value = 1.458e-06 -> unequal variance

    F test to compare two variances

data:  isdssi.nonoud$ISDSSI and isdssi.oud$ISDSSI
F = 0.67218, num df = 838, denom df = 375, p-value = 3.597e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.5641025 0.7963774
sample estimates:
ratio of variances 
         0.6721818 
isdssi_ttest <- t.test(isdssi.nonoud$ISDSSI, isdssi.oud$ISDSSI)
print(isdssi_ttest) #statistically different.

    Welch Two Sample t-test

data:  isdssi.nonoud$ISDSSI and isdssi.oud$ISDSSI
t = 3.294, df = 610.18, p-value = 0.001045
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2609288 1.0314215
sample estimates:
mean of x mean of y 
 8.702026  8.055851 

2.4.1 Descriptive Analysis

Sample characteristics

Gender distribution is relatively even between groups. Individuals with OUD are generally younger. Marital status is higher among non-OUD individuals.Non-OUD individuals tend to have higher levels of education. Financial stability is less common among individuals with OUD.

library(Hmisc)

Attaching package: 'Hmisc'
The following object is masked from 'package:parsnip':

    translate
The following objects are masked from 'package:dplyr':

    src, summarize
The following objects are masked from 'package:base':

    format.pval, units
library(table1)

Attaching package: 'table1'
The following objects are masked from 'package:Hmisc':

    label, label<-, units
The following objects are masked from 'package:base':

    units, units<-
# Preprocessing: Filter out rows where "Refused to answer" is present
final_data_clean <- final_data[!(final_data$DEM002 == "(X)Subject refused to answer" | 
                                 final_data$DEM004 == "(X)Subject refused to answer"), ]
final_data_clean <- droplevels(final_data_clean)

# Relabel the cohort variable
final_data_clean$cohort <- factor(final_data_clean$cohort, levels = c(0, 1), labels = c("Non-OUD", "OUD"))

# Assigning labels to variables for better readability
label(final_data_clean$DEM002) <- "Gender"
label(final_data_clean$DEM003) <- "Age (years)"
label(final_data_clean$DEM004) <- "Marital Status"
label(final_data_clean$DEM005) <- "Household"
label(final_data_clean$DEM006) <- "Education"
label(final_data_clean$DEM007) <- "Employment"
label(final_data_clean$DEM008) <- "Financial Status"

# Print demographic characteristics
cat("Demographic characteristics in non-OUD and OUD groups")
Demographic characteristics in non-OUD and OUD groups
# Creating the table
table1(~ DEM002 + DEM003 + DEM004 + DEM005 + DEM006 + DEM007 + DEM008 | cohort, 
       data = final_data_clean, rowlabelhead = "Demographics")
Demographics Non-OUD
(N=901)
OUD
(N=430)
Overall
(N=1331)
Gender
Male 299 (33.2%) 231 (53.7%) 530 (39.8%)
Female 602 (66.8%) 199 (46.3%) 801 (60.2%)
Age (years)
Mean (SD) 38.2 (12.7) 24.9 (11.1) 33.9 (13.7)
Median [Min, Max] 39.0 [2.00, 69.0] 22.0 [5.00, 57.0] 35.0 [2.00, 69.0]
Marital Status
Married/Partnered 539 (59.8%) 98 (22.8%) 637 (47.9%)
Separated 28 (3.1%) 33 (7.7%) 61 (4.6%)
Divorced 154 (17.1%) 77 (17.9%) 231 (17.4%)
Never married 145 (16.1%) 210 (48.8%) 355 (26.7%)
Widowed 35 (3.9%) 12 (2.8%) 47 (3.5%)
Household
Live alone 149 (16.5%) 80 (18.6%) 229 (17.2%)
One other person 407 (45.2%) 116 (27.0%) 523 (39.3%)
More than one other person 345 (38.3%) 234 (54.4%) 579 (43.5%)
Education
Grade 6 or less 3 (0.3%) 2 (0.5%) 5 (0.4%)
Grade 7 to 12 49 (5.4%) 77 (17.9%) 126 (9.5%)
Graduated high school or high school equivalent 183 (20.3%) 167 (38.8%) 350 (26.3%)
Part college 233 (25.9%) 103 (24.0%) 336 (25.2%)
Graduated 2 years college 120 (13.3%) 29 (6.7%) 149 (11.2%)
Graduated 4 years college 145 (16.1%) 28 (6.5%) 173 (13.0%)
Part graduate/professional school 44 (4.9%) 7 (1.6%) 51 (3.8%)
Completed graduate/professional school 124 (13.8%) 17 (4.0%) 141 (10.6%)
Employment
Yes 230 (25.5%) 120 (27.9%) 350 (26.3%)
No 671 (74.5%) 310 (72.1%) 981 (73.7%)
Financial Status
Ca not make ends meet 176 (19.5%) 216 (50.2%) 392 (29.5%)
Have just enough to get along 416 (46.2%) 179 (41.6%) 595 (44.7%)
Are comfortable 309 (34.3%) 35 (8.1%) 344 (25.8%)

Predictors - descriptive analysis (not working due to missingness)

# Relabel the cohort variable
final_data_clean$cohort <- factor(final_data_clean$cohort, levels = c(0, 1), labels = c("Non-OUD", "OUD"))

# Assigning labels to variables for better readability
label(final_data_clean$pain_severity_sum) <- "Pain severity"
label(final_data_clean$pain_interf_sum) <-  "Pain interference"
label(final_data_clean$dependence_level) <- "Nicotine dependence"
label(final_data_clean$Anxiety) <- "Anxiety"
label(final_data_clean$Depression) <- "Depression"
label(final_data_clean$ca_sum) <- "Pain Catastrophizing"
label(final_data_clean$pps_sum) <- "Mental defeat"
label(final_data_clean$suicidality_risk) <- "Suicidality"
label(final_data_clean$SIDSSI) <- "Social interaction"
label(final_data_clean$SSDSSI) <- "Satisfaction with social support"
label(final_data_clean$ISDSSI) <- "Social support"

# Print demographic characteristics
cat("Clinical and Psychosocial Characteristics Stratified by OUD Status")
Clinical and Psychosocial Characteristics Stratified by OUD Status

2.5 Developing ML models

Given the skewness of the dataset, three algorithms were selected for their effectiveness in handling skewed variable distributions: Logistic Regression, Random Forest, and XGBoost.

2.5.1 Missing data Imputation

Before developing models, I need to address missing data. For this, random forest imputation was utilized. This method accounts for the distribution of variables and imputes values accordingly, rather than relying on mean, median, or mode. For continuous predictors, the imputed value is calculated as the weighted average of non-missing observations, with weights determined by proximities. For categorical predictors, the imputed value corresponds to the category with the highest average proximity. While most variables had less than 10% missing data, the mental defeat score (pps_sum) and suicidality risk exhibited significant levels of missingness. However, these variables are clinically important factors associated with the outcome (OUD). As such, they were included in the prediction model.

To evaluate whether these factors can be reliably incorporated and whether the imputation was performed appropriately, I will compare models with and without these variables. This analysis will be discussed in the section titled “Model Selection Despite Missing Data Imputation.”

library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
##Missing data count when merged
colSums(is.na(final_data))
           SubjID            cohort            DEM002            DEM003 
                0                 0                 0                 0 
           DEM004            DEM005            DEM006            DEM007 
                0                 0                 0                 0 
           DEM008 pain_severity_sum   pain_interf_sum  dependence_level 
                0                55               113                58 
          Anxiety        Depression            ca_sum           pps_sum 
               70                70               127               522 
 suicidality_risk           alcohol            SIDSSI            SSDSSI 
              606                77               100               114 
           ISDSSI 
              103 
# Missing data imputation, using randomforest package
str(final_data)
'data.frame':   1331 obs. of  21 variables:
 $ SubjID           : num  1000 1001 1002 1003 1004 ...
 $ cohort           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ DEM002           : Factor w/ 3 levels "(X)Subject refused to answer",..: 3 3 2 3 3 3 2 2 3 2 ...
  ..- attr(*, "label")= chr "What is your gender"
 $ DEM003           : num  35 32 28 42 27 39 24 42 39 39 ...
 $ DEM004           : Factor w/ 6 levels "(X)Subject refused to answer",..: 2 2 4 2 2 2 5 2 4 2 ...
  ..- attr(*, "label")= chr "Current marital status"
 $ DEM005           : Factor w/ 3 levels "Live alone","One other person",..: 1 2 1 2 2 2 1 2 1 3 ...
 $ DEM006           : Factor w/ 8 levels "Grade 6 or less",..: 1 4 3 5 6 7 6 2 5 3 ...
 $ DEM007           : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 1 2 ...
 $ DEM008           : Factor w/ 3 levels "Ca not make ends meet",..: 1 2 1 2 3 2 2 3 3 1 ...
 $ pain_severity_sum: num  NA 32 25 12 13 20 22 23 22 25 ...
 $ pain_interf_sum  : num  NA 48 50 3 15 39 54 14 55 35 ...
 $ dependence_level : chr  "Moderate" "Non-smoker" "Non-smoker" "Non-smoker" ...
 $ Anxiety          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 1 ...
 $ Depression       : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 2 1 ...
 $ ca_sum           : num  22 9 18 5 12 16 18 0 30 21 ...
 $ pps_sum          : num  NA NA NA 10 44 NA 26 NA NA NA ...
 $ suicidality_risk : chr  NA NA NA NA ...
 $ alcohol          : num  0 NA NA NA NA NA 1 0 1 0 ...
 $ SIDSSI           : num  NA 5 4 10 3 4 5 9 5 9 ...
  ..- attr(*, "label")= chr "Total score for Social Interaction"
  ..- attr(*, "format.spss")= chr "F8.2"
 $ SSDSSI           : num  NA 13 9 21 21 17 15 21 10 21 ...
  ..- attr(*, "label")= chr "Total score for Subjective Support"
  ..- attr(*, "format.spss")= chr "F8.2"
 $ ISDSSI           : num  NA 8 10 6 12 9 10 9 0 11 ...
  ..- attr(*, "label")= chr "Total score for Instrumental Support"
  ..- attr(*, "format.spss")= chr "F8.2"
# Using rfImpute to impute missing values in predictor variables
final_data <- final_data %>%
  mutate(across(where(is.character), as.factor))
imputed_data <- rfImpute(cohort ~ ., data = final_data, iter = 5, ntree = 500)
ntree      OOB      1      2
  500:  12.47%  5.99% 26.05%
ntree      OOB      1      2
  500:  13.45%  6.33% 28.37%
ntree      OOB      1      2
  500:  12.77%  5.88% 27.21%
ntree      OOB      1      2
  500:  12.77%  5.55% 27.91%
ntree      OOB      1      2
  500:  13.90%  6.44% 29.53%
imputed_data <- imputed_data %>% 
  select(-SubjID)

head(imputed_data) 
  cohort DEM002 DEM003            DEM004           DEM005
1      0 Female     35 Married/Partnered       Live alone
2      0 Female     32 Married/Partnered One other person
3      0   Male     28          Divorced       Live alone
4      0 Female     42 Married/Partnered One other person
5      0 Female     27 Married/Partnered One other person
6      0 Female     39 Married/Partnered One other person
                                           DEM006 DEM007
1                                 Grade 6 or less    Yes
2                                    Part college     No
3 Graduated high school or high school equivalent     No
4                       Graduated 2 years college     No
5                       Graduated 4 years college    Yes
6               Part graduate/professional school     No
                         DEM008 pain_severity_sum pain_interf_sum
1         Ca not make ends meet          22.12006        38.05583
2 Have just enough to get along          32.00000        48.00000
3         Ca not make ends meet          25.00000        50.00000
4 Have just enough to get along          12.00000         3.00000
5               Are comfortable          13.00000        15.00000
6 Have just enough to get along          20.00000        39.00000
  dependence_level Anxiety Depression ca_sum  pps_sum suicidality_risk
1         Moderate       1          1     22 50.52258        High Risk
2       Non-smoker       1          1      9 28.97023         Low Risk
3       Non-smoker       1          1     18 37.90648         Low Risk
4       Non-smoker       1          0      5 10.00000         Low Risk
5       Non-smoker       1          1     12 44.00000         Low Risk
6       Non-smoker       1          1     16 31.35197         Low Risk
    alcohol    SIDSSI SSDSSI   ISDSSI
1 0.0000000  4.523532 15.509  8.06337
2 0.4548799  5.000000 13.000  8.00000
3 0.3705730  4.000000  9.000 10.00000
4 0.4644983 10.000000 21.000  6.00000
5 0.4957024  3.000000 21.000 12.00000
6 0.4520621  4.000000 17.000  9.00000
colSums(is.na(imputed_data))
           cohort            DEM002            DEM003            DEM004 
                0                 0                 0                 0 
           DEM005            DEM006            DEM007            DEM008 
                0                 0                 0                 0 
pain_severity_sum   pain_interf_sum  dependence_level           Anxiety 
                0                 0                 0                 0 
       Depression            ca_sum           pps_sum  suicidality_risk 
                0                 0                 0                 0 
          alcohol            SIDSSI            SSDSSI            ISDSSI 
                0                 0                 0                 0 
summary(imputed_data)
 cohort                           DEM002        DEM003     
 0:901   (X)Subject refused to answer:  0   Min.   : 2.00  
 1:430   Male                        :530   1st Qu.:23.00  
         Female                      :801   Median :35.00  
                                            Mean   :33.92  
                                            3rd Qu.:44.00  
                                            Max.   :69.00  
                                                           
                          DEM004                           DEM005   
 (X)Subject refused to answer:  0   Live alone                :229  
 Married/Partnered           :637   One other person          :523  
 Separated                   : 61   More than one other person:579  
 Divorced                    :231                                   
 Never married               :355                                   
 Widowed                     : 47                                   
                                                                    
                                             DEM006    DEM007   
 Graduated high school or high school equivalent:350   Yes:350  
 Part college                                   :336   No :981  
 Graduated 4 years college                      :173            
 Graduated 2 years college                      :149            
 Completed graduate/professional school         :141            
 Grade 7 to 12                                  :126            
 (Other)                                        : 56            
                           DEM008    pain_severity_sum pain_interf_sum
 Ca not make ends meet        :392   Min.   : 0.00     Min.   : 0.00  
 Have just enough to get along:595   1st Qu.:16.00     1st Qu.:23.00  
 Are comfortable              :344   Median :21.00     Median :34.00  
                                     Mean   :20.85     Mean   :32.73  
                                     3rd Qu.:26.00     3rd Qu.:44.00  
                                     Max.   :40.00     Max.   :60.00  
                                                                      
        dependence_level Anxiety Depression     ca_sum         pps_sum     
 High           : 67     0:652   0:708      Min.   : 0.00   Min.   : 0.00  
 Low            :106     1:679   1:623      1st Qu.:10.00   1st Qu.:18.00  
 Low to Moderate:123                        Median :15.00   Median :30.21  
 Moderate       :199                        Mean   :15.42   Mean   :35.23  
 Non-smoker     :836                        3rd Qu.:21.00   3rd Qu.:52.86  
                                            Max.   :36.00   Max.   :96.00  
                                                                           
  suicidality_risk    alcohol           SIDSSI           SSDSSI     
 High Risk: 321    Min.   :0.0000   Min.   : 0.000   Min.   : 7.00  
 Low Risk :1010    1st Qu.:0.0000   1st Qu.: 4.000   1st Qu.:15.00  
                   Median :0.0000   Median : 5.000   Median :18.00  
                   Mean   :0.3908   Mean   : 5.155   Mean   :17.27  
                   3rd Qu.:1.0000   3rd Qu.: 7.000   3rd Qu.:20.00  
                   Max.   :1.0000   Max.   :13.000   Max.   :21.00  
                                                                    
     ISDSSI      
 Min.   : 0.000  
 1st Qu.: 7.000  
 Median : 9.000  
 Mean   : 8.483  
 3rd Qu.:11.000  
 Max.   :12.000  
                 

2.5.2 Missing data Imputation

library(rsample)

set.seed (1234)
mydata_split <- initial_split(imputed_data,
                            strata = cohort,# maintaining same percentage of cases/controls
                            prop = 0.80)
mydata_training <- training(mydata_split)
mydata_test <- testing(mydata_split)

#10-fold cross validation: given there is a small sample size and I am not doing parameter tuning, I will not create held out data and use the whole dataset.
set.seed(1234)
mydata_folds <-vfold_cv(imputed_data, v=10) #okay to use the whole data, without held out data
mydata_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [1197/134]> Fold01
 2 <split [1198/133]> Fold02
 3 <split [1198/133]> Fold03
 4 <split [1198/133]> Fold04
 5 <split [1198/133]> Fold05
 6 <split [1198/133]> Fold06
 7 <split [1198/133]> Fold07
 8 <split [1198/133]> Fold08
 9 <split [1198/133]> Fold09
10 <split [1198/133]> Fold10

2.5.3. Model Selection With Missing Data Imputation

In order to determine whether the imputation was done properly without changing the model performance too much, I will compare two models; one with all variables included and the other with excluding these two variabels. I could also compare the model using complete cases, but given that sample size between the models too different, I chose this comparison. When comparing, along with auc roc value, false negative (which has more clinically importance) will also be considered. I will select the model that has better performance with false negative. significant predictors in two models will also be compared.

Model excluding the two variables with significant missingness

This model can be compared with the model including these two variables in 2.5.4.

imputed_data_drop <- imputed_data %>%
  select(-pps_sum, -suicidality_risk)

set.seed (1234)
mydata_split_drop <- initial_split(imputed_data_drop,
                            strata = cohort,#maintaining same percentage of cases/controls
                            prop = 0.80)
mydata_training_drop <- training(mydata_split_drop)
mydata_test_drop <- testing(mydata_split_drop)

mydata_glm_drop <- glm (cohort ~., data = mydata_training_drop, family = binomial(logit))
summary(mydata_glm_drop)

Call:
glm(formula = cohort ~ ., family = binomial(logit), data = mydata_training_drop)

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                            4.871759   1.488825
DEM002Female                                          -0.873050   0.200291
DEM003                                                -0.065588   0.009382
DEM004Separated                                        0.931988   0.428391
DEM004Divorced                                         0.855561   0.271467
DEM004Never married                                    0.983038   0.259283
DEM004Widowed                                          0.610713   0.660946
DEM005One other person                                 0.232893   0.302483
DEM005More than one other person                       0.470502   0.301568
DEM006Grade 7 to 12                                    0.033994   1.186643
DEM006Graduated high school or high school equivalent -0.432911   1.159214
DEM006Part college                                    -0.866326   1.163105
DEM006Graduated 2 years college                       -1.314249   1.184582
DEM006Graduated 4 years college                       -0.968168   1.183993
DEM006Part graduate/professional school               -2.035691   1.356486
DEM006Completed graduate/professional school          -0.668029   1.202882
DEM007No                                              -0.221128   0.232128
DEM008Have just enough to get along                   -0.441290   0.215569
DEM008Are comfortable                                 -0.903898   0.328066
pain_severity_sum                                     -0.062049   0.017074
pain_interf_sum                                       -0.006896   0.009713
dependence_levelLow                                   -0.257552   0.475033
dependence_levelLow to Moderate                        0.534342   0.471202
dependence_levelModerate                               0.410330   0.436979
dependence_levelNon-smoker                            -1.350814   0.410637
Anxiety1                                              -0.121497   0.244069
Depression1                                            0.147940   0.247708
ca_sum                                                 0.043644   0.015636
alcohol                                               -0.715496   0.212122
SIDSSI                                                 0.144223   0.046983
SSDSSI                                                -0.066514   0.035852
ISDSSI                                                -0.067205   0.037788
                                                      z value Pr(>|z|)    
(Intercept)                                             3.272 0.001067 ** 
DEM002Female                                           -4.359 1.31e-05 ***
DEM003                                                 -6.991 2.74e-12 ***
DEM004Separated                                         2.176 0.029589 *  
DEM004Divorced                                          3.152 0.001624 ** 
DEM004Never married                                     3.791 0.000150 ***
DEM004Widowed                                           0.924 0.355488    
DEM005One other person                                  0.770 0.441336    
DEM005More than one other person                        1.560 0.118716    
DEM006Grade 7 to 12                                     0.029 0.977146    
DEM006Graduated high school or high school equivalent  -0.373 0.708812    
DEM006Part college                                     -0.745 0.456369    
DEM006Graduated 2 years college                        -1.109 0.267231    
DEM006Graduated 4 years college                        -0.818 0.413520    
DEM006Part graduate/professional school                -1.501 0.133431    
DEM006Completed graduate/professional school           -0.555 0.578651    
DEM007No                                               -0.953 0.340787    
DEM008Have just enough to get along                    -2.047 0.040648 *  
DEM008Are comfortable                                  -2.755 0.005865 ** 
pain_severity_sum                                      -3.634 0.000279 ***
pain_interf_sum                                        -0.710 0.477695    
dependence_levelLow                                    -0.542 0.587696    
dependence_levelLow to Moderate                         1.134 0.256796    
dependence_levelModerate                                0.939 0.347722    
dependence_levelNon-smoker                             -3.290 0.001003 ** 
Anxiety1                                               -0.498 0.618627    
Depression1                                             0.597 0.550349    
ca_sum                                                  2.791 0.005249 ** 
alcohol                                                -3.373 0.000743 ***
SIDSSI                                                  3.070 0.002143 ** 
SSDSSI                                                 -1.855 0.063562 .  
ISDSSI                                                 -1.779 0.075322 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1339.23  on 1063  degrees of freedom
Residual deviance:  731.74  on 1032  degrees of freedom
AIC: 795.74

Number of Fisher Scoring iterations: 6
vif(mydata_glm_drop)
                      GVIF Df GVIF^(1/(2*Df))
DEM002            1.117894  1        1.057305
DEM003            1.416818  1        1.190301
DEM004            1.637117  4        1.063555
DEM005            1.454338  2        1.098162
DEM006            1.631675  7        1.035591
DEM007            1.206075  1        1.098214
DEM008            1.376089  2        1.083083
pain_severity_sum 1.830443  1        1.352939
pain_interf_sum   2.269812  1        1.506590
dependence_level  1.317632  4        1.035081
Anxiety           1.638869  1        1.280183
Depression        1.734281  1        1.316921
ca_sum            1.656417  1        1.287019
alcohol           1.101571  1        1.049558
SIDSSI            1.245237  1        1.115902
SSDSSI            1.816772  1        1.347877
ISDSSI            1.441135  1        1.200473
#Model building
lr_cls_spec_2 <- 
  logistic_reg() |> 
  set_engine("glm")

#Model fit 
lr_cls_fit_2 <- 
  lr_cls_spec_2 |>
  fit(cohort ~ ., data = mydata_training_drop)

#Prediction using the "TEST" dataset
mydata.lr.pred.values.test_2 <-  bind_cols(
  truth = mydata_test_drop$cohort,
  predict(lr_cls_fit_2, mydata_test_drop),
  predict(lr_cls_fit_2, mydata_test_drop, type = "prob"))

mydata.lr.pred.values.test_2
# A tibble: 267 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.963 0.0374 
 2 0     0             0.910 0.0895 
 3 0     0             0.612 0.388  
 4 0     0             0.939 0.0609 
 5 0     0             0.943 0.0570 
 6 0     0             0.869 0.131  
 7 0     0             0.954 0.0464 
 8 1     1             0.412 0.588  
 9 0     0             0.991 0.00917
10 0     0             0.740 0.260  
# ℹ 257 more rows
#ROC plots for testing data
autoplot(roc_curve(mydata.lr.pred.values.test_2, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(mydata.lr.pred.values.test_2, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.820
2 kap         binary         0.581
3 mn_log_loss binary         0.395
4 roc_auc     binary         0.883
conf_matrix_wihtout <- mydata.lr.pred.values.test_2 %>%
  conf_mat(truth = truth, estimate = .pred_class)

# Print the confusion matrix
print(conf_matrix_wihtout)
          Truth
Prediction   0   1
         0 160  27
         1  21  59

Result: Given the similarity in model performance, the model that includes the two variables is preferable, as these variables hold clinical significance. More importantly, the imputation process does not appear to negatively affect the model’s performance. Especially, the model with the two variables showed a slightly higher AUC-ROC valu and fewer false negative cases (exclusion model n=29 Vs inclusion model n=28). Clinically, minimizing false negatives is crucial in a predictive model. Therefore, further development of the algorithm will proceed using the imputed data with these two variables included.

2.5.4. Logistic regression

glm (Training/Test)

#general model, with the  whole dataset
forwholeglm <- final_data %>% 
  select(-SubjID)
whole_glm <- glm(cohort ~., data = forwholeglm, family=binomial(logit))
summary(whole_glm)

Call:
glm(formula = cohort ~ ., family = binomial(logit), data = forwholeglm)

Coefficients:
                                                        Estimate Std. Error
(Intercept)                                            6.305e+00  2.434e+00
DEM002Female                                          -8.908e-01  2.561e-01
DEM003                                                -6.194e-02  1.246e-02
DEM004Separated                                        5.731e-01  6.136e-01
DEM004Divorced                                         3.985e-02  3.799e-01
DEM004Never married                                    4.745e-01  3.256e-01
DEM004Widowed                                         -2.376e-01  7.551e-01
DEM005One other person                                -4.703e-01  4.046e-01
DEM005More than one other person                      -2.187e-01  4.111e-01
DEM006Grade 7 to 12                                    1.349e-01  2.160e+00
DEM006Graduated high school or high school equivalent -5.028e-02  2.139e+00
DEM006Part college                                    -4.823e-01  2.138e+00
DEM006Graduated 2 years college                       -8.433e-01  2.155e+00
DEM006Graduated 4 years college                        1.376e-01  2.149e+00
DEM006Part graduate/professional school               -6.214e-01  2.266e+00
DEM006Completed graduate/professional school          -2.230e-01  2.160e+00
DEM007No                                               1.324e-01  3.052e-01
DEM008Have just enough to get along                   -9.866e-02  2.796e-01
DEM008Are comfortable                                 -1.135e+00  4.317e-01
pain_severity_sum                                     -2.457e-02  2.214e-02
pain_interf_sum                                       -2.181e-02  1.265e-02
dependence_levelLow                                   -7.406e-01  5.965e-01
dependence_levelLow to Moderate                        1.912e-01  5.847e-01
dependence_levelModerate                               3.932e-01  5.426e-01
dependence_levelNon-smoker                            -1.420e+00  5.005e-01
Anxiety1                                               2.331e-01  3.254e-01
Depression1                                            1.413e-01  3.393e-01
ca_sum                                                 3.054e-02  2.403e-02
pps_sum                                               -9.971e-05  6.625e-03
suicidality_riskLow Risk                              -6.244e-01  3.303e-01
alcohol                                               -6.094e-01  2.638e-01
SIDSSI                                                 1.692e-01  5.788e-02
SSDSSI                                                -1.355e-01  4.488e-02
ISDSSI                                                -5.281e-02  4.639e-02
                                                      z value Pr(>|z|)    
(Intercept)                                             2.590 0.009601 ** 
DEM002Female                                           -3.479 0.000504 ***
DEM003                                                 -4.971 6.66e-07 ***
DEM004Separated                                         0.934 0.350314    
DEM004Divorced                                          0.105 0.916452    
DEM004Never married                                     1.457 0.144996    
DEM004Widowed                                          -0.315 0.752988    
DEM005One other person                                 -1.162 0.245090    
DEM005More than one other person                       -0.532 0.594660    
DEM006Grade 7 to 12                                     0.062 0.950198    
DEM006Graduated high school or high school equivalent  -0.024 0.981249    
DEM006Part college                                     -0.226 0.821511    
DEM006Graduated 2 years college                        -0.391 0.695617    
DEM006Graduated 4 years college                         0.064 0.948936    
DEM006Part graduate/professional school                -0.274 0.783887    
DEM006Completed graduate/professional school           -0.103 0.917754    
DEM007No                                                0.434 0.664494    
DEM008Have just enough to get along                    -0.353 0.724211    
DEM008Are comfortable                                  -2.628 0.008584 ** 
pain_severity_sum                                      -1.110 0.267008    
pain_interf_sum                                        -1.724 0.084781 .  
dependence_levelLow                                    -1.242 0.214378    
dependence_levelLow to Moderate                         0.327 0.743639    
dependence_levelModerate                                0.725 0.468598    
dependence_levelNon-smoker                             -2.837 0.004547 ** 
Anxiety1                                                0.716 0.473818    
Depression1                                             0.416 0.677142    
ca_sum                                                  1.271 0.203667    
pps_sum                                                -0.015 0.987992    
suicidality_riskLow Risk                               -1.890 0.058735 .  
alcohol                                                -2.310 0.020901 *  
SIDSSI                                                  2.924 0.003460 ** 
SSDSSI                                                 -3.020 0.002530 ** 
ISDSSI                                                 -1.138 0.254915    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 785.04  on 607  degrees of freedom
Residual deviance: 445.80  on 574  degrees of freedom
  (723 observations deleted due to missingness)
AIC: 513.8

Number of Fisher Scoring iterations: 6
vif(whole_glm)
                      GVIF Df GVIF^(1/(2*Df))
DEM002            1.126734  1        1.061477
DEM003            1.615394  1        1.270981
DEM004            2.071490  4        1.095306
DEM005            1.634523  2        1.130701
DEM006            1.912659  7        1.047411
DEM007            1.283880  1        1.133084
DEM008            1.464056  2        1.099992
pain_severity_sum 1.850743  1        1.360420
pain_interf_sum   2.499721  1        1.581051
dependence_level  1.511706  4        1.053012
Anxiety           1.770504  1        1.330603
Depression        2.010602  1        1.417957
ca_sum            2.334839  1        1.528018
pps_sum           2.329359  1        1.526224
suicidality_risk  1.333993  1        1.154986
alcohol           1.167542  1        1.080528
SIDSSI            1.317627  1        1.147879
SSDSSI            1.745899  1        1.321325
ISDSSI            1.477109  1        1.215364
mydata_glm <- glm (cohort ~., data = mydata_training, family = binomial(logit))
summary(mydata_glm) #sig: pain_severity, smoking low, smoking non-smoker

Call:
glm(formula = cohort ~ ., family = binomial(logit), data = mydata_training)

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                            4.840508   1.593864
DEM002Female                                          -0.897447   0.202720
DEM003                                                -0.063197   0.009507
DEM004Separated                                        0.803917   0.442565
DEM004Divorced                                         0.798697   0.274568
DEM004Never married                                    0.919259   0.263566
DEM004Widowed                                          0.419838   0.698572
DEM005One other person                                 0.195089   0.306673
DEM005More than one other person                       0.412557   0.306145
DEM006Grade 7 to 12                                    0.081645   1.305488
DEM006Graduated high school or high school equivalent -0.322409   1.279733
DEM006Part college                                    -0.803518   1.283370
DEM006Graduated 2 years college                       -1.155847   1.304803
DEM006Graduated 4 years college                       -0.942509   1.302355
DEM006Part graduate/professional school               -1.772972   1.468918
DEM006Completed graduate/professional school          -0.587692   1.316514
DEM007No                                              -0.251986   0.236112
DEM008Have just enough to get along                   -0.334782   0.221405
DEM008Are comfortable                                 -0.764345   0.334768
pain_severity_sum                                     -0.054509   0.017314
pain_interf_sum                                       -0.008652   0.009992
dependence_levelLow                                   -0.351772   0.481097
dependence_levelLow to Moderate                        0.505829   0.477444
dependence_levelModerate                               0.352186   0.443116
dependence_levelNon-smoker                            -1.277211   0.414882
Anxiety1                                              -0.173318   0.249623
Depression1                                            0.082587   0.257324
ca_sum                                                 0.025464   0.016946
pps_sum                                                0.008006   0.005851
suicidality_riskLow Risk                              -0.715353   0.251040
alcohol                                               -0.744001   0.215548
SIDSSI                                                 0.157207   0.047737
SSDSSI                                                -0.048672   0.036580
ISDSSI                                                -0.064984   0.038071
                                                      z value Pr(>|z|)    
(Intercept)                                             3.037 0.002390 ** 
DEM002Female                                           -4.427 9.55e-06 ***
DEM003                                                 -6.648 2.98e-11 ***
DEM004Separated                                         1.816 0.069294 .  
DEM004Divorced                                          2.909 0.003627 ** 
DEM004Never married                                     3.488 0.000487 ***
DEM004Widowed                                           0.601 0.547843    
DEM005One other person                                  0.636 0.524681    
DEM005More than one other person                        1.348 0.177791    
DEM006Grade 7 to 12                                     0.063 0.950133    
DEM006Graduated high school or high school equivalent  -0.252 0.801092    
DEM006Part college                                     -0.626 0.531249    
DEM006Graduated 2 years college                        -0.886 0.375704    
DEM006Graduated 4 years college                        -0.724 0.469252    
DEM006Part graduate/professional school                -1.207 0.227435    
DEM006Completed graduate/professional school           -0.446 0.655308    
DEM007No                                               -1.067 0.285867    
DEM008Have just enough to get along                    -1.512 0.130513    
DEM008Are comfortable                                  -2.283 0.022418 *  
pain_severity_sum                                      -3.148 0.001642 ** 
pain_interf_sum                                        -0.866 0.386512    
dependence_levelLow                                    -0.731 0.464665    
dependence_levelLow to Moderate                         1.059 0.289394    
dependence_levelModerate                                0.795 0.426733    
dependence_levelNon-smoker                             -3.078 0.002081 ** 
Anxiety1                                               -0.694 0.487482    
Depression1                                             0.321 0.748252    
ca_sum                                                  1.503 0.132942    
pps_sum                                                 1.368 0.171209    
suicidality_riskLow Risk                               -2.850 0.004378 ** 
alcohol                                                -3.452 0.000557 ***
SIDSSI                                                  3.293 0.000991 ***
SSDSSI                                                 -1.331 0.183333    
ISDSSI                                                 -1.707 0.087840 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1339.23  on 1063  degrees of freedom
Residual deviance:  717.63  on 1030  degrees of freedom
AIC: 785.63

Number of Fisher Scoring iterations: 6
vif(mydata_glm) #VIF all less than 5; no significant colinearity was found among variables
                      GVIF Df GVIF^(1/(2*Df))
DEM002            1.119756  1        1.058185
DEM003            1.435164  1        1.197983
DEM004            1.667609  4        1.066011
DEM005            1.461300  2        1.099474
DEM006            1.675409  7        1.037549
DEM007            1.215990  1        1.102720
DEM008            1.425287  2        1.092636
pain_severity_sum 1.837685  1        1.355612
pain_interf_sum   2.349596  1        1.532839
dependence_level  1.370704  4        1.040203
Anxiety           1.681156  1        1.296594
Depression        1.827836  1        1.351975
ca_sum            1.914646  1        1.383707
pps_sum           2.189020  1        1.479534
suicidality_risk  1.447052  1        1.202935
alcohol           1.113369  1        1.055163
SIDSSI            1.269666  1        1.126795
SSDSSI            1.851173  1        1.360578
ISDSSI            1.452914  1        1.205369
#Model building
lr_cls_spec <- 
  logistic_reg() |> 
  set_engine("glm")

#Model fit to training dataset
lr_cls_fit <- 
  lr_cls_spec |>
  fit(cohort ~ ., data = mydata_training)

#Prediction using the "testing" dataset
mydata.lr.pred.values.test <-  bind_cols(
  truth = mydata_test$cohort,
  predict(lr_cls_fit, mydata_test),
  predict(lr_cls_fit, mydata_test, type = "prob")
)
mydata.lr.pred.values.test
# A tibble: 267 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.962  0.0378
 2 0     0             0.879  0.121 
 3 0     0             0.732  0.268 
 4 0     0             0.931  0.0688
 5 0     0             0.950  0.0505
 6 0     0             0.885  0.115 
 7 0     0             0.953  0.0474
 8 1     1             0.247  0.753 
 9 0     0             0.988  0.0117
10 0     0             0.783  0.217 
# ℹ 257 more rows
#ROC plots for "testing" dataset
autoplot(roc_curve(mydata.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on "testing" dataset
metrics(mydata.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.824
2 kap         binary         0.591
3 mn_log_loss binary         0.396
4 roc_auc     binary         0.884
conf_matrix_all <- mydata.lr.pred.values.test %>%
  conf_mat(truth = truth, estimate = .pred_class)
print(conf_matrix_all)
          Truth
Prediction   0   1
         0 160  26
         1  21  60

Logistic Regression - Cross Validation (10-fold)

#model building
lr_cls_spec_cv <- 
  logistic_reg() |> 
  set_engine("glm")

#Model fit to FULL dataset
lr_cls_fit_cv <- 
  lr_cls_spec_cv |>
  fit(cohort ~ ., data = imputed_data) #here, I'm using full data.

#Create a workflow() for fitting the glm
glm_wf <- workflow() |>
  add_model(lr_cls_spec_cv) |> 
  add_formula(cohort ~ .)
  
#Cross validation fitting
glm_fit_cv <- 
  glm_wf |>
  fit_resamples(mydata_folds, control = control_resamples(save_pred = TRUE))

#Collect predictions out of folds into one tibble
mydata_glm_cv_preds <- collect_predictions(glm_fit_cv)

#Plot of ROC curve of CV results
autoplot(roc_curve(mydata_glm_cv_preds, 
        cohort, 
        .pred_0))

#To see performance of each fold
mydata_glm_cv_preds |>
  group_by(id) |>
  roc_auc(cohort, .pred_0)
# A tibble: 10 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.920
 2 Fold02 roc_auc binary         0.902
 3 Fold03 roc_auc binary         0.912
 4 Fold04 roc_auc binary         0.929
 5 Fold05 roc_auc binary         0.874
 6 Fold06 roc_auc binary         0.893
 7 Fold07 roc_auc binary         0.901
 8 Fold08 roc_auc binary         0.878
 9 Fold09 roc_auc binary         0.880
10 Fold10 roc_auc binary         0.860
#Showing  roc curves for all folds
mydata_glm_cv_preds |>
  group_by(id) |>
  roc_curve(cohort, .pred_0) |>
  autoplot()

#Overall metrics
collect_metrics(glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.841    10 0.00783 Preprocessor1_Model1
2 brier_class binary     0.116    10 0.00470 Preprocessor1_Model1
3 roc_auc     binary     0.895    10 0.00691 Preprocessor1_Model1

Confusion matrix for lr

# Convert probabilities to binary predictions using a threshold
mydata_glm_cv_preds <- mydata_glm_cv_preds |> 
  mutate(
    predicted = ifelse(.pred_1 > 0.5, 1, 0),               # Binary predictions
    predicted = factor(predicted, levels = c(0, 1)),       # Convert to factor
    cohort = factor(cohort, levels = c(0, 1))              # Ensure cohort is a factor
  )

# Calculate the confusion matrix
conf_mat_lr <- conf_mat(mydata_glm_cv_preds, truth = cohort, estimate = predicted)
print(conf_mat_lr)
          Truth
Prediction   0   1
         0 817 127
         1  84 303
# Accuracy
accuracy_lr <- accuracy(mydata_glm_cv_preds, truth = cohort, estimate = predicted)

# Precision
precision_lr<- precision(mydata_glm_cv_preds, truth = cohort, estimate = predicted)

# Recall (Sensitivity)
recall_lr <- recall(mydata_glm_cv_preds, truth = cohort, estimate = predicted)

# F1-Score
f1_lr <- f_meas(mydata_glm_cv_preds, truth = cohort, estimate = predicted, beta = 1)

# Display metrics
metrics_lr <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy_lr$.estimate, precision_lr$.estimate, recall_lr$.estimate, f1_lr$.estimate)
)
print(metrics_lr)
# A tibble: 4 × 2
  Metric    Value
  <chr>     <dbl>
1 Accuracy  0.841
2 Precision 0.865
3 Recall    0.907
4 F1-Score  0.886

2.5.5. Random forest

Testing/Training

library(randomForest)
library(vip)

Attaching package: 'vip'
The following object is masked from 'package:utils':

    vi
#Model specification/building
rf_spec <- 
  rand_forest(trees = 300,min_n = 5) |> #Min number of data points in node for further split
  set_engine("randomForest", importance = TRUE) |>
  set_mode("classification")

#Model fit
rf_fit <- rf_spec |>
  fit(cohort ~ ., data = mydata_training)
rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~300, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 300
No. of variables tried at each split: 4

        OOB estimate of  error rate: 14.66%
Confusion matrix:
    0   1 class.error
0 670  50  0.06944444
1 106 238  0.30813953
#Prediction using the "TEST" dataset
rf_predicted <- bind_cols(
  truth = mydata_test$cohort,
  predict(rf_fit, mydata_test),
  predict(rf_fit, mydata_test, type = "prob")
)

rf_predicted
# A tibble: 267 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.847  0.153 
 2 0     0             0.95   0.05  
 3 0     0             0.583  0.417 
 4 0     0             0.913  0.0867
 5 0     0             0.947  0.0533
 6 0     0             0.9    0.1   
 7 0     0             0.93   0.07  
 8 1     1             0.293  0.707 
 9 0     0             0.99   0.01  
10 0     0             0.84   0.16  
# ℹ 257 more rows
#ROC plots for testing data
roc_auc(rf_predicted, truth, .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.876
autoplot(roc_curve(rf_predicted, 
                   truth, 
                   .pred_0))

Random Forest Cross Validation

#I will then use cross-validation to re-estimate the predictive ability more fairly; I will use the same folds as created for the glm model above, "mydata_folds"

#workflow setup
rf_workflow <-
  workflow() |>
  add_model(rf_spec) |>
  add_formula(cohort ~ .)

set.seed (1234)

#Use workflow to fit model with each fold of re-sampled data
rf_fit_cv <-
  rf_workflow |>
  fit_resamples(mydata_folds, control = control_resamples(save_pred = TRUE))

#one ROC plot for cross validation datasets
rf_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)|>
  autoplot()

# Collect predictions for each fold
cv_predictions <- rf_fit_cv |> collect_predictions()

# To see the performance of each fold using ROC AUC
cv_predictions |>
  group_by(id) |>
  roc_auc(cohort, .pred_0)
# A tibble: 10 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.918
 2 Fold02 roc_auc binary         0.911
 3 Fold03 roc_auc binary         0.909
 4 Fold04 roc_auc binary         0.935
 5 Fold05 roc_auc binary         0.890
 6 Fold06 roc_auc binary         0.906
 7 Fold07 roc_auc binary         0.905
 8 Fold08 roc_auc binary         0.879
 9 Fold09 roc_auc binary         0.877
10 Fold10 roc_auc binary         0.880
#calculating overall roc value
overall_roc_data <- cv_predictions |>
  roc_curve(cohort, .pred_0)

#Overall metrics
collect_metrics(rf_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.853    10 0.00935 Preprocessor1_Model1
2 brier_class binary     0.114    10 0.00370 Preprocessor1_Model1
3 roc_auc     binary     0.901    10 0.00605 Preprocessor1_Model1

Creating 10-fold CV ROC curves for visualization

# Calculate ROC data for each fold
fold_roc_data <- cv_predictions |>
  group_by(id) |>
  roc_curve(cohort, .pred_0)

# Calculate AUC for each fold
fold_auc <- cv_predictions |>
  group_by(id) |>
  roc_auc(cohort, .pred_0)
fold_auc
# A tibble: 10 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.918
 2 Fold02 roc_auc binary         0.911
 3 Fold03 roc_auc binary         0.909
 4 Fold04 roc_auc binary         0.935
 5 Fold05 roc_auc binary         0.890
 6 Fold06 roc_auc binary         0.906
 7 Fold07 roc_auc binary         0.905
 8 Fold08 roc_auc binary         0.879
 9 Fold09 roc_auc binary         0.877
10 Fold10 roc_auc binary         0.880
# Calculate the overall ROC curve
overall_roc_data <- cv_predictions |>
  roc_curve(cohort, .pred_0)


# Create labels with AUC for each fold
fold_auc_labels <- fold_auc |>
  mutate(label = paste0(id, " (AUC = ", round(.estimate, 3), ")"))

# Update the fold IDs in cv_predictions with the new labels
cv_predictions <- cv_predictions |>
  left_join(fold_auc_labels, by = "id") |>
  mutate(id = label)

# Plot ROC curves for each fold with AUC in legend and overlay overall ROC curve
cv_predictions |>
  group_by(id) |>
  roc_curve(cohort, .pred_0) |>
  autoplot() +
  geom_line(data = overall_roc_data, aes(x = 1 - specificity, y = sensitivity), 
            color = "black", size = 0.5) +
  labs(
    title = "ROC Curves for Each Fold with Overall ROC Curve",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Fold (AUC)"
  ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

#To understand the variables that most contribute tot eh classification, I extract the importance scores using the vip package.
rf_fit |>
  extract_fit_engine() |>
  importance()
                           0          1 MeanDecreaseAccuracy MeanDecreaseGini
DEM002             5.7291581  5.9777965            7.8654226         6.362394
DEM003            19.1952060 23.8226091           29.0141162        60.119084
DEM004             9.2983379 10.6498381           13.9684480        27.446527
DEM005             4.1256282  1.6448577            4.4086939         7.333858
DEM006             2.0939078 10.0538045            8.5746361        25.900134
DEM007             0.3769183 -1.5213245           -0.6183887         2.314975
DEM008             6.7439375  8.4989072           10.9169662        13.198006
pain_severity_sum  8.4481774  7.1418239           11.0224364        24.311989
pain_interf_sum    8.5052318  2.6693844            8.5106861        21.307665
dependence_level  20.4207191 26.2332374           29.0577811        68.352529
Anxiety            4.8683562  1.3361159            4.6229029         2.435441
Depression         4.8292654 -2.3422671            2.8489566         2.804627
ca_sum            10.3198167  1.2479925            9.4561863        20.622516
pps_sum           13.3058760  8.6189580           16.5164370        36.690803
suicidality_risk  12.3998941  8.1555039           14.2565995        25.210713
alcohol            4.6949991  4.0984950            6.2190820         6.340665
SIDSSI             1.7624689  0.4920929            1.8053706        13.877814
SSDSSI             9.0311470  8.8281867           12.3053492        23.608097
ISDSSI             1.3976989  2.9865378            2.8197335        14.169062
rf_fit |>
  extract_fit_engine() |>
  vip()

Confusion matrix/ calculation

# Ensure that both 'cohort' (truth) and 'predicted' (estimate) are factors
cv_predictions <- cv_predictions |> 
  mutate(predicted = ifelse(.pred_1 > 0.5, 1, 0))

cv_predictions <- cv_predictions |>
  mutate(
    predicted = factor(predicted, levels = c(0, 1)),  # Convert predicted to a factor
    cohort = factor(cohort, levels = c(0, 1))         # Convert cohort to a factor
  )
# Confusion matrix
conf_mat <- cv_predictions |> 
  conf_mat(truth = cohort, estimate = predicted)
print(conf_mat)
          Truth
Prediction   0   1
         0 837 132
         1  64 298
# Accuracy
accuracy <- accuracy(cv_predictions, truth = cohort, estimate = predicted)

# Precision
precision <- precision(cv_predictions, truth = cohort, estimate = predicted)

# Recall (Sensitivity)
recall <- recall(cv_predictions, truth = cohort, estimate = predicted)

# F1-Score
f1 <- f_meas(cv_predictions, truth = cohort, estimate = predicted, beta = 1)

# Display metrics
metrics <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy$.estimate, precision$.estimate, recall$.estimate, f1$.estimate)
)
print(metrics)
# A tibble: 4 × 2
  Metric    Value
  <chr>     <dbl>
1 Accuracy  0.853
2 Precision 0.864
3 Recall    0.929
4 F1-Score  0.895

2.5.6 Gradient boosting

Testing/Training

#loading the packages
library(xgboost)

Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':

    slice
#Model specification/building
bt_spec <- 
  boost_tree(trees = 50,        #Number of trees
             tree_depth = 4) |> #Max depth of tree
  set_mode("classification") |>
  set_engine("xgboost")
bt_spec
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 50
  tree_depth = 4

Computational engine: xgboost 
#Recipe
bt_recipe <-
  recipe(cohort ~ ., data = mydata_training) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_predictors())

#Workflow specification
bt_workflow <- workflow() |>
  add_model(bt_spec) |>
  add_recipe(bt_recipe)

#Model fit with training data
bt_fit <- fit(bt_workflow, data = mydata_training)
bt_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 76.7 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.3, max_depth = 4, gamma = 0, 
    colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 50, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
  eta = "0.3", max_depth = "4", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 35 
niter: 50
nfeatures : 35 
evaluation_log:
  iter training_logloss
 <num>            <num>
     1       0.55108652
     2       0.46867860
   ---              ---
    49       0.10117315
    50       0.09836642
#Prediction model using the "TEST" dataset
my.bt.pred <- bind_cols(
  truth = mydata_test$cohort,
  predict(bt_fit, mydata_test),
  predict(bt_fit, mydata_test, type = "prob"))

#ROC plots for testing data
roc_auc(my.bt.pred, truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.853
autoplot(roc_curve(my.bt.pred, 
                   truth, 
                   .pred_0))

#Variable importance extracted with vip
vip(bt_fit)

# Variable importance extracted from xgboost object
bt_object <- pull_workflow_fit(bt_fit)$fit
Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.
xgb.importance(model = bt_object)
                                                   Feature         Gain
                                                    <char>        <num>
 1:                            dependence_level_Non.smoker 0.2510084069
 2:                                                 DEM003 0.1532852442
 3:                                      pain_severity_sum 0.0784107892
 4:                                                pps_sum 0.0713751162
 5:                                        pain_interf_sum 0.0523647390
 6:                                                 ISDSSI 0.0521884468
 7:                                                 SSDSSI 0.0517575454
 8:                              suicidality_risk_Low.Risk 0.0452404680
 9:                                                alcohol 0.0369236332
10:                                                 SIDSSI 0.0320754471
11:                                                 ca_sum 0.0296924951
12:                               DEM004_Married.Partnered 0.0255952221
13:                                            DEM002_Male 0.0236950261
14:                                   DEM006_Grade.7.to.12 0.0130559845
15:                                   DEM004_Never.married 0.0119732011
16:                                 DEM008_Are.comfortable 0.0114869505
17: DEM006_Graduated.high.school.or.high.school.equivalent 0.0099127151
18:                      DEM005_More.than.one.other.person 0.0075289567
19:                                DEM005_One.other.person 0.0069791150
20:                       DEM006_Graduated.2.years.college 0.0055985746
21:                       dependence_level_Low.to.Moderate 0.0053912399
22:                   DEM008_Have.just.enough.to.get.along 0.0050951249
23:                              dependence_level_Moderate 0.0043642995
24:          DEM006_Completed.graduate.professional.school 0.0030668745
25:               DEM006_Part.graduate.professional.school 0.0030537959
26:                       DEM006_Graduated.4.years.college 0.0028898602
27:                                             Anxiety_X1 0.0021029296
28:                                              DEM007_No 0.0016761085
29:                                   dependence_level_Low 0.0011335591
30:                                    DEM006_Part.college 0.0007020002
31:                                          Depression_X1 0.0003761308
                                                   Feature         Gain
           Cover   Frequency
           <num>       <num>
 1: 0.0845346355 0.023890785
 2: 0.1594013890 0.098976109
 3: 0.0943114637 0.112627986
 4: 0.1045626851 0.114334471
 5: 0.0816849477 0.098976109
 6: 0.0669610130 0.097269625
 7: 0.0614724768 0.075085324
 8: 0.0229658261 0.017064846
 9: 0.0392018251 0.044368601
10: 0.0691050393 0.064846416
11: 0.0498762174 0.056313993
12: 0.0175696064 0.018771331
13: 0.0241666854 0.020477816
14: 0.0141022093 0.011945392
15: 0.0060534054 0.006825939
16: 0.0153128947 0.018771331
17: 0.0065772194 0.010238908
18: 0.0065129084 0.022184300
19: 0.0109529871 0.013651877
20: 0.0157159374 0.010238908
21: 0.0057150741 0.013651877
22: 0.0084954867 0.011945392
23: 0.0033950203 0.006825939
24: 0.0007518971 0.003412969
25: 0.0140618430 0.006825939
26: 0.0050361060 0.006825939
27: 0.0007979533 0.003412969
28: 0.0061817748 0.005119454
29: 0.0025299278 0.001706485
30: 0.0003044934 0.001706485
31: 0.0016890514 0.001706485
           Cover   Frequency

Cross Validation - Testing/Training

#workflow setup
bt_workflow <-
  workflow() |>
  add_model(bt_spec) |>
  add_formula(cohort ~ .)

set.seed (1234)
#Use workflow to fit model with each fold of re-sampled data
bt_fit_cv <-
  bt_workflow |>
  fit_resamples(mydata_folds, control = control_resamples(save_pred = TRUE))

#ROC plots for cross validation datasets
bt_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)|>
  autoplot()

#Overall metrics
collect_metrics(bt_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.838    10 0.0106  Preprocessor1_Model1
2 brier_class binary     0.123    10 0.00625 Preprocessor1_Model1
3 roc_auc     binary     0.889    10 0.00822 Preprocessor1_Model1

Confusion matrix

# Collect predictions from the cross-validation results
bt_predictions <- bt_fit_cv |>
  collect_predictions()

# Convert probabilities to binary predictions
bt_predictions <- bt_predictions |> 
  mutate(
    predicted = ifelse(.pred_1 > 0.5, 1, 0),               # Binary predictions
    predicted = factor(predicted, levels = c(0, 1)),       # Convert to factor
    cohort = factor(cohort, levels = c(0, 1))              # Ensure cohort is a factor
  )

# Calculate overall metrics
bt_metrics <- bt_predictions |>
  metrics(truth = cohort, estimate = predicted, .pred_1) |>
  filter(.metric %in% c("accuracy", "precision", "recall", "f_meas"))

print(bt_metrics)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.838
# Accuracy
accuracy_bt <- accuracy(bt_predictions, truth = cohort, estimate = predicted)

# Precision
precision_bt<- precision(bt_predictions, truth = cohort, estimate = predicted)

# Recall (Sensitivity)
recall_bt<- recall(bt_predictions, truth = cohort, estimate = predicted)

# F1-Score
f1_bt <- f_meas(bt_predictions, truth = cohort, estimate = predicted, beta = 1)

# Display metrics
metrics_bt <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy_bt$.estimate, precision_bt$.estimate, recall_bt$.estimate, f1_bt$.estimate)
)
print(metrics_bt)
# A tibble: 4 × 2
  Metric    Value
  <chr>     <dbl>
1 Accuracy  0.838
2 Precision 0.860
3 Recall    0.908
4 F1-Score  0.883

2.5.7.Model Performance

Three predictive models were implemented to predict opioid use disorder. 10-fold cross validation was conducted to evaluate model performances. All predictive models achieved satisfactory performance When comparing roc_auc values for those 3 algoriths with cross-validation, ramdom forests had the highest value.

#Naming
lr_roc_data <- roc_curve(mydata.lr.pred.values.test, truth, .pred_0)
lr_roc_cv_data <- roc_curve(mydata_glm_cv_preds, cohort, .pred_0)
rf_roc_data <- roc_curve (rf_predicted, truth,.pred_0)
rf_roc_cv_data <- rf_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)
bt_roc_data <- roc_curve (my.bt.pred, truth, .pred_0)
bt_roc_data <- bt_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)

#ggplot
ggplot() +
  geom_path(data = lr_roc_data, aes(x = 1 - specificity, 
                                     y = sensitivity, 
                                     color = "blue")) +
  geom_path(data = lr_roc_cv_data, aes(x = 1 - specificity, 
                                        y = sensitivity,
                                        color = "lightblue")) +
  geom_path(data = rf_roc_data, aes(x = 1 - specificity, 
                                       y = sensitivity,
                                       color = "pink")) +
  geom_path(data = rf_roc_cv_data, aes(x = 1 - specificity, 
                                    y = sensitivity, 
                                    color = "red")) +
  geom_path(data = bt_roc_data, aes(x = 1 - specificity, 
                                    y = sensitivity, 
                                    color = "purple")) +
   geom_path(data = bt_roc_data, aes(x = 1 - specificity, 
                                    y = sensitivity, 
                                    color = "Green")) +
  labs(
    title = "Combined Cross-Validated ROC Curves with Confidence Intervals",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Model",
    fill = "Model"
  ) +
  scale_color_manual(values = c("blue", "lightblue", "pink", "red", "purple", "Green"), 
                     labels = c("LR",
                                "LR - 10-fold CV",
                                "RF",
                                "RF - 10-fold CV",
                                "GT",
                                "GT - 10-fold CV")) +  
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

Discussion

The RF model demonstrated the highest overall performance, with an F1-score of 0.8936, a recall of 0.9276, and the highest AUC of 0.8969, indicating satisfactory discriminatory ability. The XGBoost model achieved the highest precision (0.8669) but had a slightly lower AUC (0.8780) compared to the other models.The 10-fold ROC curves (light blue for LR, red for RF, and green for GT) highlight the models’ performance across different subsets of the data. While RF shows the best overall sensitivity across a range of false positive rates, GT underperformed relative to RF and LR, as evidenced by its lower AUC despite visually appearing to follow RF’s trajectory. LR, although slightly less sensitive than RF in some regions, exhibited more consistent performance than GT. These findings indicate that RF offers the best overall predictive performance, while GT may struggle with reliability compared to LR and RF. When looking at the 10-fold models of the random forest algorithm, the AUC values for the folds range from 0.869 to 0.927, indicating consistently strong performance across all folds, with minor variability. And they are closely clustered around the overalll curve, which suggests the model’s stability.

This model also provides clinical and research implications. In practice, this model enables to identify high-risk patients early, which can help clinical decision making. Also, feature importance analysis identified top three factors that contribute to the outcome(OUD), therefore, interventions need to be developed targeting these factors. Interestingly, these results align with certain findings in genetics, particularly the shared genetic risk factors between nicotine and OUD. Also, this study highlights the role of mental defeat — a factor not yet incorporated into existing OUD models — which should be explored further in future research. This work also lays the foundation for future research by providing a strong predictive framework to support the development of more advanced machine learning models.

Limitation

The results of these analysis need to be interpreted with caution. Importantly, due to the genotypic objective of the parent study, all participants were Caucasian participants, limiting the generalizability of findings to more diverse populations. Secondly, significant number of missing data were observed in two key factors within this model. Despite this, these variables are clinically considered highly relevant to the outcome, and their imputation does not appear to have adversely affected the model’s performance, as discussed previously. Lastly, this model was developed based on a relatively small sample size. However, it was specifically tailored for chronic pain patients on long-term opioid therapy, a population at higher risk of developing OUD compared to the general population. Considering that the model incorporates various significant risk factors that are often excluded from existing machine learning models, this model offers meaningful insights and potential contributions to the field.

Conclusion

Random forest achieved the best predictive performance. Feature importance analysis identified nicotine dependence, age, and mental defeat as the top three significant features. By demonstrating the satisfactory prediction performance, this model not only offers a strong predictive framework, but also lays the groundwork for more advanced machine learning models that make use of a broader range of variables, ultimately improving early identification and intervention strategies for at-risk patients.